Cream of the Crop 26

home *** CD-ROM | disk | FTP | other *** search

/ Cream of the Crop 26 / Cream of the Crop 26.iso / editor / htmst707.zip / HTMSTRIP.DOC < prev next >

Wrap

Text File | 1997-07-31 | 49KB | 987 lines

HTMSTRIP.DOC 1 Jul 31, 1997 WIN95 AND WINNT NOTICE: As with most DOS-based utilities, this program doesn't understand the weird subdirectories, long filenames, invalid characters that are possible under Windows 95 and Windows/NT. Both operating systems alias long filenames into names like MYFILE~1.TXT and you will need to specify the aliased versions of file names to process them. Under some file structure systems in NT, the program may not work at all. The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding, and write the file out as something more useful. Features of this program: * Ideal way to prep HTML documents for later retransmission via e-mail (which doesn't support the fonts, pictures, etc). Beats out Netscape's Save As Text option hands down. * Can be run across an entire subdirectory (for example, your entire cache subdirectory), and will only process the HTML documents that it finds. (There are some options on this.) * Removes all embedded HTML commands. * Recodes the standard HTML "entity references" (so "©" becomes "(c)"). The actual replacements are coded in a user-modifiable lookup file. * Handles standard indent, heading, selection groups, menus, tables, etc. * Reflows all text as appropriate. * Can provide character-translation table to filter out characters that only work under Windows. * Can indicate bolding, underlining, etc with user-specified characters. * Optionally, will replace Link, Image, and Input references with user-definable text representations. * Optionally, alerts you to possible errors in the HTML code itself. * Supports ISO 8859/1 8-bit single-byte (Windows), 7-bit DOS ASCII, and 8-bit DOS ASCII character sets. * Optionally creates a logfile of file activity. * Pressing escape stops the program early. HTML codes are surrounded within <...> indicators. For upward compatibility reasons, Web browsers ignore any codes that they don't understand and just process the ones they can handle. HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;" "entity references" (for example, "©" is replaced by "(c)"). You can add or change these replacements as desired by using the INI file (documented later). Quickie instructions: Okay! You hate to read. I know that. And there aren't any cute pictures in this documentation and, like everything I write, it's way too long to keep your attention for long. So, let's bottom line it; what's the quickest way to use this program without learning any of the options? Let's presume you're running under Windows. Take the HTMSTRIP.EXE and HTMSTRIP.INI files from the HTMSTymm.ZIP file and copy them to the same subdirectory somewhere. (They should be in the same subdirectory already since that's how uncompressing them would have gone.) This subdirectory should be in your path. If you're not sure what your path is, hop to DOS and type "SET". There should be a line shown that says something like "PATH=C:\;C:\DOS;C:\WINDOWS". I wouldn't advise copying HTMSTRIP.EXE and HTMSTRIP.INI to your WINDOWS subdirectory. Maybe your root? HTMSTRIP.DOC 2 Jul 31, 1997 Get on the Web and save the source of an HTML document to your hard disk. This is done from the Netscape Navigator by bringing up a page and saying "Save As...". Remember the file name and what subdirectory you saved the document to. Just for example's sake, let's say the file name is "UPEPIS.HTM". Hop to DOS. (You can run HTMSTRIP from the Run option in Windows but it's easier to explain this way.) Make the directory where you saved the document your default subdirectory. (This is usually done with a series of "CD" commands in DOS.) Now, type: HTMSTRIP You didn't pass in any parameters so HTMSTRIP will request the name of the file to process. Enter the name of the HTML file. In our case, this would appear like: Enter filespec to process? UPEPIS.HTM Presuming you did everything correctly, the HTMSTRIP program will read the HTML file and tell you it created a new file with the file extension of ".OUT" (in our case, "UPEPIS.OUT"). That was pretty easy. Now, hop back into Windows and bring the new file up in your text editor (use Write or something else that uses TrueType fonts instead of NotePad). With luck, you'll see the file looking similar to how it did when you were viewing it under your Web browser. The difference is that it's now a properly-formatted text document which fits on the screen and can be e-mailed to someone. Hop back into DOS. Type "HTMSTRIP /?". You'll see there are a bunch of other parameters that you can pass in. If you're not pleased with the output file that was created, you might want to read the quick on-screen description of each option and then consult the HTMSTRIP.DOC file for more instructions about anything that sounds interesting. Chances are, you won't want to revise any of the system defaults at least initially. If you find yourself consistently needing to change some options, you might want to edit the HTMSTRIP.INI file to specify those new defaults. Read the BRUCEINI.DOC file for information on overriding defaults. Note that the instructions tell you you can use wildcards for the input file name. You can do something like "HTMSTRIP *.HTM" and it will process every file with an ".HTM" extension in your default subdirectory. HTMSTRIP.DOC 3 Jul 31, 1997 HTML codes: HTMSTRIP is also tuned to allow it to specially-handle several embedded HTML codes found through HTML version 3.2. These codes are the following: Supported Element Attributes Description  Comments (skip) <A ...>...</A> External link HREF=site Start of hypertext link ID=anchor Establishes target for hypertext links NAME=anchor Establishes target for hypertext links <AREA> Client-side image hotspot HREF=site Hypertext link ALT=text What to display if text-only environment <B>..</B> Bold text <BASE ...> Establishing a root directory HREF=site Prefix to add to unqualified sites <BLOCKQUOTE>...</BLOCKQUOTE> Indented block of text <BR> Forced line break <CAPTION>...</CAPTION> Title for a table block <CENTER>...</CENTER> Centering text block <DD> Term definition <DIR>...</DIR> Directory list of items (obsolete) <DL>...</DL> Definition list block <DT> First term of definition list/glossary <EM>...</EM> Emphasize text <H1> to <H6>...</H1> to </H6> Heading items <HR> Horizontal rule <I>..</I> Italicize text <IMG ...> Image SRC=site Location of the image ALT=text What to display if text-only environment <INPUT ...> User input TYPE=CHECKBOX Type of input -- shown as [_] TYPE=HIDDEN Type of input -- suppress TYPE=RADIO Type of input -- shown as (_) CHECKED Makes [X] or (X) SIZE=n Specifies length for input fields VALUE=text Specifies default value for input fields <LI> Menu/Ordered/Unordered/Directory list item <MAP>...</MAP> Client-side image map <MENU>...</MENU> Menu listing block (obsolete) <OL>...</OL> Ordered listing block (HTMSTRIP skips numbers) <OPTION> Used for single/multiple choice menus <P> Paragraph indicator <PRE>...</PRE> Preserve spacing block (preformatted text) <SCRIPT>...</SCRIPT> Java script blocks are ignored <SELECT>...</SELECT> Block for single/multiple choice menu MULTIPLE Allow for multiple selections Continued... HTMSTRIP.DOC 4 Jul 31, 1997 Supported Element Attributes Description <TABLE>...</TABLE> Table block <TD>...</TD> Table data (cell) ALIGN=spec How to align the cell (default is LEFT) COLSPAN=n How many columns to span with this cell ROWSPAN=n How many rows to span with this cell <TH>...</TH> Table heading ALIGN=spec How to align the cell (default is CENTER) COLSPAN=n How many columns to span with this cell ROWSPAN=n How many rows to span with this cell <TITLE>...</TITLE> Title item <TR>...</TR> Table row <U>..</U> Underlining text <UL>...</UL> Unordered listing If you run across other codes that become vital, let me know and I'll see about handling them somehow. How to get HTML files: Some people who are using regular Web browsers like Mosaic or Netscape don't realize that they're automatically saving HTML files to their hard disk throughout every Web session. That's because just about every Web browser saves the most-recently accessed files from the Web (including HTML source code, GIF's, and JPG's) on your hard disk and reads them from there instead of requiring you to download them every time you go back to a previous page. This is typically settable by you under "Preferences" and "Cache" on your Web browser. I usually set my Web browser to have a huge cache, maybe 10MB. Anything beats downloading the same pages over again even at 28.8K. And I make sure that I do not have anything specified like "clear cache at the end of every session". Then I just go through the files in the cache subdirectory afterward and reprocess them. Two disadvantages to a cache... It takes up hard disk space but, hey, the Web browser is typically in Windows so why are you surprised. The second disadvantage is that if the page actually changes between sessions, you typically won't notice the new page as long as it remains in your cache. If you think a page is still in cache and should have been changed but didn't, you can typically ask your Web browser to reload the page. On some browsers, this is shown as an arrow in the form of a circle. HTMSTRIP can process the entire cache subdirectory. It automatically detects non-HTML files for you and processes accordingly. It creates new text file versions of just the HTML pages it finds. Another great way to get HTML pages is to use the URL-minder service at http://www.netmind.com/URL-minder/new/register.html This is a free service which automatically tells you whenever a Web page's contents changes. If you use the advanced features, you can have the Web page's HTML code sent to you as a file attachment (it's easier than dealing with the "embed" option too). Then you can run HTMSTRIP on the resulting file. HTMSTRIP.DOC 5 Jul 31, 1997 Specifying parameters: Parameters for this program can be set in the following ways. The last setting encountered always wins: - Read from an *.INI file (see BRUCEINI.DOC file), - Through the use of an environmental variable (SET HTMSTRIP=whatever), or - From the command line (see "Syntax" below) HTMSTRIP also allows you to define: - How "entity references" (things like "®") are shown - How "symbolic references" (things like "[input]" and "<B>") are shown - Which characters should be filtered into other characters (things like showing "Æ" as "'" -- none of these should actually appear on Web pages by the way) These are explained in sections at the end of this documentation. HTMSTRIP.DOC 6 Jul 31, 1997 Syntax: HTMSTRIP [ filespec | (filelist) | @listfile ] [ /Cpath ] [ outfile ] [ /EXT=.xxx ] [ /COPY=path ] [ /CREATE ] [ /ALL ] [ /ATTR=attribs ] [ /WIDTH=n ] [ /FORCE ] [ /RULE=s ] [ /BORDER=c ] [ /BUFF=n ] [ /SPACES ] [ /RSPACE ] [ /WARNINGS ] [ /-TABLE ] [ /-INDENT ] [ /CPn ] [ /A=spec ] [ /IMG=spec | /IMGALT=spec ] [ /ALTONLY ] [ /MAP=spec | /MAPALT=spec ] [ /-INPUT ] [ /Linitfile ] [ /FILTER | /FILTER=filename ] [ /LOG=logfile ] [ /Tpath ] [ /MONO ] [ /Iinitfile | /-I ] [ /-ENV ] [ /? ] [ /?&H ] where: "filespec" tells the routine which file or files are to be processed. The specification can include path and wildcards if desired. Typically, the file names are *.HTM files. If no input specification (filespec or @listfile) is provided, you'll be prompted for one. If no extension is provided, ".HTM" is presumed. (If you want to process a file which does not have an extension, include the trailing period on the file name, such as "HTMSTRIP HTTP_WWW." (with the period in there). "(filelist)" allows you to specify multiple files to be processed from the command line. File names should be separated by spaces. They may include drive, path, and wildcard information. Remember that a command line in DOS cannot exceed 127 characters so you're limited as to how many different file specifications you can provide in this fashion. "@listfile" allows you to have a variety of file specifications saved in a text file named "listfile". Each line in the file should consist of one file specification, each of which can include a path and wildcards if desired. Blank lines and lines beginning with semi-colons, colons, or quotes are ignored. If no input specification (filespec or @listfile) is provided, you'll be prompted for one. "/Cpath" specifies that the cache is found in a particular subdirectory. This allows you to specify a default location in your *.INI file (see BRUCEINI.DOC) and just specify something like "A*.HTM" for the files to process. Note, however, that if you don't use *.INI files, it's easier to just pass in the input file path with the "filespec" parameter ("HTMSTRIP *.HTM /C\CACHE" and "HTMSTRIP \CACHE\*.HTM" are the same). Defaults to your current default path. If the input filespec includes drive or path information, this will override the /Cpath specification. "outfile" is the name of the output file to create. Is overwritten without prompting if it exists already. If no output file name is provided, the routine will use the infile and provide an extension of *.OUT. (The default .OUT extension can be overridden using the /EXT=.xxx parameter.) An outfile cannot be specified if wildcards or @listfile are used for the input file specification. "/EXT=.xxx" allows you to specify a different default file extension for the output file. This parameter only matters if you do not explicitly specify an output file name. Initially defaults to "/EXT=.OUT". HTMSTRIP.DOC 7 Jul 31, 1997 "/COPY=path" specifies that the output files (for example, BRUCE.OUT when the input was BRUCE.HTM) are to be created in the specified subdirectory. By default, the program creates the output files in the same path as the input files. If the subdirectory does not exist, you will be prompted for whether to create it or not based on the value of the /CREATE parameter. "/CREATE" automatically creates the output subdirectory if /COPY=path is specified. The default is "/-CREATE"; if the subdirectory is not there, the program prompts whether it should be created or not. "/ALL" says that if the program encounters what it thinks is just a text file, it should take the file and try to fix up CR/LF problems (Unix files end with LF's instead of CR/LF which is what DOS needs) and that's it. This can be somewhat risky since it may misdiagnose a file but it should be safe if you're running it on your cache subdirectory. Initially defaults to "/-ALL" which means it won't process it unless it thinks it's an HTML file. "/-ALL" says to skip files if the program thinks it's not an HTML file. This is initially the default. "/ATTR=attribs" allows you to specify a combination of attributes that you want considered. You can specify any combination of R (read-only), H (hidden), S (system), or A (archive bit). Precede any character(s) with "-" to exclude instead of include. Unlike with the DOS DIR command, the inclusions and exclusions are subject to "OR" conditions; /ATTR=HS will retrieve any file that is either hidden or a system file or both. You can specify "/ATTR=ALL" to specify that all files are to be processed. Initially defaults to /ATTR=-H-S (skip hidden or system files). "/WIDTH=n" specifies the desired line length for wrapping long lines and also for centering. Initially defaults to "/WIDTH=80". "/FORCE" says that the specified width must be adhered to. The only exception to this is that tables may force a width expansion if the cells simply can't fit on the page otherwise. Using /FORCE means that <PRE>...</PRE> blocks may be wrapped (typically a no-no) and some words in tables may get split up if the entire word can't fit in the computed cell width. The latter is especially a problem if there are lots of cell columns in a table; there isn't much room for the actual data when the cells themselves take up so much space. Initially defaults to "/-FORCE". "/-FORCE" says that the desired widths can be ignored if table cells or <PRE>...</PRE> blocks would look more natural without it. This is initially the default. "/RULE=s" specifies that a string is to be repeated the width of the line. This is used to separate sections. The string can be a single character (like "/RULE=-"), multiple characters (like "/RULE="- ""), it can contain decimal and hexadecimal characters (like "/RULE=\066\097\116"--see BRUCEHEX.DOC), it can be "/RULE=NULL" (or "/-RULE"; both typically results in a blank line), or just simply "/RULE" (which is the same thing as "/RULE=-" if /BORDER=T and "RULE=\196" if /BORDER=S or /BORDER=D). Personally, if your printer supports IBM graphics characters, I find "/RULE=\196" to be the most pleasing of the rule lines. Initially defaults to /RULE=- . HTMSTRIP.DOC 8 Jul 31, 1997 "/BORDER=c" specifies the type of border to use. The possible choices for "c" are: D -- double line S -- single line T -- text character line -- this is the default B -- blanks (spaces) N -- none DV -- double line is used for vertical borders, lines are skipped in horizontal rows within the table itself SV -- same as DV except single line TV -- same as DV except text lines Examples of the various border types: <D>ouble <S>ingle <T>ext <B>lank <N>one ╔═══╦═══╤═══╗ ┌───┬───┬───┐ +---+---+---+ ║ 1 ║ 2 │ 3 ║ │ 1 │ 2 │ 3 │ | 1 | 2 | 3 | 1 2 3 1 2 3 ╠═══╬═══╪═══╣ ├───┼───┼───┤ +---+---+---+ 4 5 6 ║ 4 ║ 5 │ 6 ║ │ 4 │ 5 │ 6 │ | 4 | 5 | 6 | 4 5 6 7 8 9 ╟───╫───┼───╢ ├───┼───┼───┤ +---+---+---+ ║ 7 ║ 8 │ 9 ║ │ 7 │ 8 │ 9 │ | 7 | 8 | 9 | 7 8 9 ╚═══╩═══╧═══╝ └───┴───┴───┘ +---+---+---+ <DV> <SV> <TV> ╔═══╦═══╤═══╗ ┌───┬───┬───┐ +---+---+---+ ║ 1 ║ 2 │ 3 ║ │ 1 │ 2 │ 3 │ | 1 | 2 | 3 | ╠═══╬═══╪═══╣ ├───┼───┼───┤ +---+---+---+ ║ 4 ║ 5 │ 6 ║ │ 4 │ 5 │ 6 │ | 4 | 5 | 6 | ║ 7 ║ 8 │ 9 ║ │ 7 │ 8 │ 9 │ | 7 | 8 | 9 | ╚═══╩═══╧═══╝ └───┴───┴───┘ +---+---+---+ "/BUFF=n" specifies how many spaces to position on either side of the vertical bars in the tables. Defaults to /BUFF=1. "/SPACES" retains extra vertical spacing between sections. There are frequently lots of extra blank lines that appear in the output file either due to specific HTML requests or to insure proper reformatting. Specifying /SPACES allows these to stay there. "/-SPACES" removes these extra blank lines. This also tries to remove empty columns in tables as well as some blank rows in tables. This is initially the default. "/RSPACE" requires that a blank line appear before and after horizontal rule (<HR>) indicators. Using this option with /SPACES may cause multiple blank lines around horizontal rules. Initially defaults to "/-RSPACE". "/-RSPACE" doesn't force a blank line around horizontal rule indicators. This is initially the default. HTMSTRIP.DOC 9 Jul 31, 1997 "/WARNINGS" displays on-screen warnings when HTMSTRIP finds either internal problems in the document or things it can't handle. Realistically, they're not all that important because the program is working around them anyway but you might want to use them to help make suggestions to the webmaster. If you create a logfile (using the "/LOG=filename" parameter), the warnings are automatically written out to that file independently of the "/WARNINGS" setting. Initially defaults to "/-WARNINGS". "/-WARNINGS" turns off the warning messages. This is initially the default. "/TABLE" says to process text within table declaration sections as tables whenever the program can. There are some maximum cell length limits in the program and some tabular text will be dumped as straight ASCII text anyway. This is initially the default. "/-TABLE" says to process text within table declarations sections as straight text, removing it from the tabular structure entirely. There are other cases where page authors have switched to tables for formatting purposes and the resulting pages when converted to text are mostly spaces. Finally, using /-TABLE can sometimes avoid "out of string space" errors that pop up on some pages. Initially defaults to "/TABLE". "/-INDENT" removes block indent sections from the output file. By default, five spaces are inserted before each line within a <BLOCKQUOTE>...</BLOCKQUOTE> block. These can be nested so you can end up with a lot of white space in your document. "/-INDENT" turns them off. Initially defaults to "/INDENT". "/INDENT" retains the <BLOCKQUOTE>...</BLOCKQUOTE> indenting. This is initially the default. "/CPn" specifies what character pageset to use. "n" can be 1, 2, or 3: /CP1 specifies that the program should use the 7-bit DOS character set. This is the most universally recognized character set out there and should work for printing, e-mail, etc. It does not handle foreign characters or miscellaneous symbols like "£" so these are translated into rough ASCII equivalents. Since this is the lowest-common-denominator font, it's initially the default for this routine. Add /CP2 or /CP3 to your HTMSTRIP.INI file if you want to change on a regular basis. /CP2 specifies that the program should use the 8-bit DOS character set. This works within DOS applications but doesn't read correctly under Windows programs. /CP3 specifies that the program should use the ISO 8859/1 8-bit single-byte graphic character set. This set works within Windows applications but may not e-mail correctly. HTMSTRIP.DOC 10 Jul 31, 1997 "/A=spec" tells the program how to handle <A...> hypertext links. These are used when the program is supposed to hop to another HTML page or to a section within the current HTML page. The values of "spec" are mutually exclusive: /A=FSITE says to show the site name, using its full url address, and embed this name in the body of the text page /A=FSITEFN says to show the site name, using its full url address, and place this site name in a footnote section at the end of the text page /A=SITE says to show the site name, but only the part after the last "/" or "\", and embed this name in the body of the text page /A=SITEFN says to show the site name, but only the part after the last "/" or "\", and place this site name in a footnote section at the end of the text page /A=SYMBOL says to use the specified <A> symbol (initially defined as "(link)" in the HTMSTRIP.INI file) /A=NONE (or /-A) says that nothing is to be shown for hypertext links. This is initially the default. "/IMG=spec" tells the program how to handle <IMG...> links. These are used for embedded graphics. The values of "spec" are mutually exclusive and are documented in the "/A=spec" section above. Initially defaults to "/IMG=NONE" (which is the same as "/-IMG") which will result in nothing being shown for the image links. Given: <IMG SRC="../movies/Anaconda/assets/title.gif" border=0 alt="Anaconda - click to enter"> Setting Yields ------- ------ /IMG=FSITE [../movies/Anaconda/assets/title.gif] /IMG=FSITEFN [1] ../movies/Anaconda/assets/title.gif (footnote) /IMG=SITE [title.gif] /IMG=SITEFN [1] title.gif (footnote) /IMG=SYMBOL (link) /IMG=NONE (is not shown) HTMSTRIP.DOC 11 Jul 31, 1997 "/IMGALT=spec" is identical to "/IMG=spec". However, if "/IMGALT=spec" is specified (and is not "/IMGALT=SYMBOL" or "/IMGALT=NONE"), the program will look for an ALT=alias reference in the <IMG...> link and use that if found. Note that alias will be used in its entirity if it's found and it will be embedded in the output text (appearing within brackets). The "spec" items are used for any reference that doesn't have an ALT=spec specification; in this case, the program works identically to "/IMG=spec" for these. So site names might be tossed at the bottom as footnotes if "/IMGALT=SITEFN" or "/IMGALT=FSITEFN" is used but any ALT=spec items are always in the text itself. Initially defaults to "/IMGALT=NONE" (same as "/-IMGALT") which will result in nothing being shown for the image links. Given: <IMG SRC="../movies/Anaconda/assets/title.gif" border=0 alt="Anaconda - click to enter"> Setting Yields ------- ------ /IMGALT=FSITE [Anaconda - click to enter] /IMGALT=FSITEFN [Anaconda - click to enter] (*not* footnote) /IMGALT=SITE [Anaconda - click to enter] /IMGALT=SITEFN [Anaconda - click to enter] (*not* footnote) /IMGALT=SYMBOL (link) /IMGALT=NONE (nothing shown) "/ALTONLY" specifies that if an ALT=alias reference exists in an <IMG...> link, then the alias should be embedded in the output text (appearing within brackets) but, otherwise, all <IMG...> references are to be ignored in the input file. Initially defaults to "/-ALTONLY". "/-ALTONLY" allows <IMG...> references to be added to output file even if an ALT=alias reference is not specified. This is initially the default. "/MAP=spec" and "/MAPALT=spec" work the same as "/IMG=spec" and "/IMGALT=spec" do but they apply to <AREA> specifications within a <MAP>...</MAP> block. Initially defaults to "/MAP=NONE" (which is the same as "/-MAP"). "/-INPUT" skips any indication of the <INPUT> flags. Initially defaults to "/INPUT". "/INPUT" shows <INPUT> flags. This allows the "<INPUT> = 5<@+>" (or however you have it defined) from HTMSTRIP.INI to be activated. This is initially the default. "/L" says to read "&xxx;" entity references and "<A>" etc symbol lookup codes from your /Iinitfile file. This is initially the default. "/Linitfile" says to read the "&xxx;" entity references and "<A>" etc symbol lookup codes from the specified file "initfile". Specifying another file is primarily useful if you want to have a master *.INI file and a separate code lookup table. Initially defaults to "/L". "/-L" says to not process any entity references or symbol lookup codes. Initially defaults to "/L". HTMSTRIP.DOC 12 Jul 31, 1997 "/FILTER" specifies that the program is to replace specific characters in the input file. See the "Defining Character-Translations" discussion below. When this parameter is in effect, the program looks for character translations in the entity reference file (/Linitfile), which typically defaults to your initialization file (/Iinitfile). The is initially the default. "/FILTER=filename" specifies that a filter is to be applied and all character replacements are in the file "filename". See the "Defining Character-Translations" discussion below. "/-FILTER" says to not bother removing the nonprintable characters from the output. Initially defaults to "/FILTER". "/LOG=logfile" specifies that the program should create a simple log file showing what files were processed when and what (if any) errors were encountered. If the logfile exists already, it will be appended to (lines will be added to the end of it). If no drive or path is specified, the file will be created in your default drive or path. Initially defaults to "/-LOG" (don't create a logfile). "/-LOG" says to not create a log file at all. This is initially the default. "/LOG" is the same as "/LOG=HTMSTRIP.LOG". "/Tpath" specifies where to write the temporary files that the routine needs. Examples are "/TC:" and "/TC:\TEMP". If not specified, the routine writes to the following in sequence: - the value of any TEMP, then TMP, environmental variable - C:\TEMP - C:\ "/MONO" (or "/-COLOR") does not try to override screen colors. Initially defaults to "/COLOR". "/COLOR" (or "/-MONO") allows screen colors to be overridden. This is initially the default. "/Iinitfile" says to read an initialization file with the file name "initfile". The file specification *must* contain a period. If no drive or path information is specified, the program will search for initfile beginning in your default subdirectory and then going throughout your DOS path. The use of an initialization file is optional. Initially defaults to "/IHTMSTRIP.INI". "/-I" (or "/INULL") says to skip loading the initialization file. Note that this also drops loading the file that translates things like "&xxx;" so you should specify /Linitfile if you drop the other file. "/ENV" says to look for %var% occurrences in the command line and try to resolve any apparent environmental variable references. See BRUCEINI.DOC for more information. This is initially the default. "/-ENV" says to skip resolving apparent %var% occurrences in the command line. Initially defaults to "/ENV". HTMSTRIP.DOC 13 Jul 31, 1997 "/?" or "/HELP" or "HELP" shows you the syntax for the command. "/?&H" gives you a hexadecimal and decimal conversion table. Return codes: HTMSTRIP returns the following ERRORLEVEL codes: 0 = no problems, all files processed 251 = could not find a file to process 253 = operation aborted by pressing Escape 255 = syntax problems, or /? requested HTMSTRIP.DOC 14 Jul 31, 1997 Defining entity references: HTMSTRIP will process an entity reference definition file is one is found. This table can be in your standard *.INI file (for example, HTMSTRIP.INI) if desired or it can be a separate file specified using the /Linitfile parameter. Entity references are how non-standard characters like the copyright character are handled in HTML pages. Entity references are indicated as "&xxx;" where "xxx" is either a code or a number preceded by a pound sign. The copyright symbol is indicated in HTML as "©". A default HTMSTRIP.INI is provided with over 300 entity reference lookups. To define or change these lookups, the INI file should include a series of lines in the following format: &xxx; = _outstr1_outstr2_outstr3_ where "&xxx;" is the HTML sequence and "outstr1", "outstr2", and "outstr3" is what you want to replace it with. There are three available lookup strings to match the setting for the character pageset parameter ("/CPn"): * The first character(s) ("outstr1") correspond to the characters used under 7-bit DOS (/CP1). Files created using this character set can be e-mailed to anyone and looks identical under DOS and Windows. Foreign characters and symbols are translated into fairly boring, generic characters. * The second character(s) ("outstr2") correspond to the characters used under 8-bit DOS (/CP2). Files created using this character set look fine under DOS but look sick under Windows. * The third character(s) ("outstr3") correspond to the characters used under the ISO 8859/1 8-bit single-byte graphic character set. Files created using this character set look fine under Windows but look bad under DOS. For example: Æ = _AE_Æ_╞_ will use "AE" if /CP1 is in effect, "Æ" if /CP2 is in effect, and "╞" if /CP3 is in effect. Note that at least one of these "outstr" elements will look incorrect to you if you're viewing this help file under Windows or DOS. See the discussion about ENTITY.HTM below in order to see how the different character sets are viewed under different environments. In cases where the characters are identical between all character sets, you can just include the lookup once: & = & The same lookup value will be used irregardless of what character set you're under. HTMSTRIP.DOC 15 Jul 31, 1997 The "outstr" portions can consist of regular non-space ASCII text characters (like "A" or "z") as well as hexadecimal values (in the form &Hxx) or decimal values (in the form \nnn). (See the BRUCEHEX.DOC file.) They can also be the word "NULL" which translates the string into nothing. You cannot use a space or equal sign in "outstr"; use the hexadecimal or decimal representations instead. The table does not have to be in any specified order. Lines can end with "/*" followed by a comment if you want. Examples: ¢ = _cents_¢_ó_ /* Cent symbol © = _(c)_(c)_⌐_ /* Copyright symbol ° = _degree_°_░_ /* Degree symbol = \032 /* Thick space Remember that "&xxx;" entity references (yes, I hate that phrase) are case-sensitive in HTML. "°" will not find "&Deg;". There seems to be a trend of late to relax some of the replacement coding requirements in Web pages. The ";" is now, apparently, becoming optional. Numeric replacements (for example, " ") seem to no longer require the leading pound sign. Therefore, HTMSTRIP looks for both of these iterations for any appropriate lookup. "©" will find "©" and "™" will find "&153". The lookup itself has to be entered in the formally correct way though. You can see how these files will be processed under each character pageset by testing out the ENTITY.HTM file that is provided with the HTMSTymm.ZIP file. This contains all of the entity references defined in HTMSTRIP.INI as of March 1997. To try all three of the character sets, issue the following commands: HTMSTRIP ENTITY.HTM ENTITY.DOS /CP1 HTMSTRIP ENTITY.HTM ENTITY.IBM /CP2 HTMSTRIP ENTITY.HTM ENTITY.WIN /CP3 Then view the resulting files under the DOS EDIT command as well as under the Windows Write program. HTMSTRIP.DOC 16 Jul 31, 1997 Defining the Symbolic References: You are also allowed to redefine the strings that are used for several symbolic references in the entity reference file. For example, if your source code contains an <IMG> (inline image) reference, HTMSTRIP can indicate this by putting some text in place of the image. (HTMSTRIP is text only so it's not going to put the actual image in there.) The first three replacements shown below are conditional based on other parameters: * The <A> indicator replaces hyperlink references if /A=SYMBOL is specified. * The <IMG> indicator replaces inline image references if /IMG=SYMBOL or /IMGALT=SYMBOL is specified. * The <INPUT> indicator replaces input fields if /INPUT is left as the default. * <I> replaces italics-on and </I> replaces italics-off. * <U> replaces underline-on and </U> replaces underline-off. * <B> replaces bold-on and </B> replaces bold-off. * <EM> replaces emphasis-on and </EM> replaces emphasis-off. * <TITLE> ... </TITLE> indicates how to handle the document's title. * <H1> ... </H1> indicates how to handle the level 1 headings. Similarly, <H2> ... </H2> through <H6> ... </H6> indicates how to handle those levels of headings. The default indicators are the following: Symbol Meaning Default Value <A> hyperlinks -> (link) <IMG> inline image -> (image) <INPUT> input fields -> 5<@+> <I> and </I> italics on/off -> (null) <U> and </U> underline on/off -> (null) <B> and </B> bold on/off -> (null) <EM> and </EM> emphasis on/off -> (null) <TITLE> and </TITLE> document title -> (null) <H1> through <H6> and </H1> thru </H6> level headings -> (null) You can redefine any and all of these entity references in the same lookup file. These substitutions are specified more or less like the previous substitutions. For example: <A> = (link) <IMG> = (image) <INPUT> = 5<@+> <U> = _ </U> = _ <B> = * </B> = * Unlike with the other lookups, the left side is not case sensitive so "<a>=(link)" works just fine. Hexadecimal and decimal replacements are again acceptable (see BRUCEHEX.DOC file). You might, for example, want to redefine some of them like this: <A> = \251 /* Replaces with a √ symbol <IMG> = \015 /* Replaces with a symbol (little flash cube) <INPUT> = ? /* Replaces with a question mark HTMSTRIP.DOC 17 Jul 31, 1997 The replacements aren't always perfect. Web browsers don't italicize or display in bold spaces so the following will look perfectly fine under Netscape or Internet Explorer: The<B> Minnow </B>was Gilligan's ship. However, if you have the following in your INI file: <B> = * </B> = * The text will show up as: The* Minnow *was Gilligan's ship. Which makes it look like the wrong words are emphasized. This is unfortunate but it's the way things work. If you normally print the results of everything from HTMSTRIP, you can probably find the print codes that are appropriate for your printer that will change the text in the way you want. For example, if you're using a Hewlett-Packard LaserJet printer, printer codes are shown in the User's Manual which can do different types of bolding, underlining, etc. You have to make sure that you turn off the settings with the </xx> option (e.g. </B>) though. The following should work on many HP LaserJets (check your manual and replace with the appropriate codes if not): <I> = \027(s1S /* Turns italicizing on </I> = \027(s0S /* Turns italicizing off (restores upright) <U> = \027&d0D /* Turns underlining on </U> = \027&d@ /* Turns underlining off <B> = \027(s2B /* Turns demi-bolding on </B> = \027(s0B /* Turns bolding off <EM> = \027(s1B /* Turns semi-bolding on </EM> = \027(s0B /* Turns bolding off (restores normal weight) Note that the program counts all characters (including these special print-setting characters which don't themselves print) when it reflows text. Also note that, on the HP at least, underlining underlines spaces as well as characters, including indents. Any symbolic references that you do not redefine will default to their original values. The <INPUT> item is a bit of a special case. It has several special options, and they are all present in the default value. <INPUT> is used to indicate that the HTML page prompted for, typically, a bit of text. In the actual HTML page, this might be coded as: <INPUT NAME=q size=45 maxlength=200 VALUE=""> HTMSTRIP.DOC 18 Jul 31, 1997 Ignoring most of the parameter, the "size=45" parameter says that the Web navigator is to present an input line to the user which is 45 characters in length. "VALUE=""" indicates that no default value is provided for this input. The default symbolic reference for handling an <INPUT> request is: <INPUT> = 5<@+> Each item of the assignment is explained below: <INPUT> specifies the <INPUT> replacement 5 means the maximum input length (SIZE=x) to be provided is 5 characters; the value can be any number between 1 and 255; this rule is sometimes waived (see below) < and > are extra text characters that will appear @ says to fill in the default value (VALUE="" above) is one is provided + says to expand the input field based on an specified length (SIZE=45 above); if no SIZE= is provided on the page, a default of SIZE=5 will be used; expansion is done using underscore characters With the above settings, if the program encountered this: <INPUT NAME=q size=45 maxlength=200 VALUE=""> It would actually write out the input references as: <___> Similarly, if the program encountered this: <INPUT TYPE=submit VALUE=Submit> It would write out this: <Submit> On the other hand, the program will expand the field beyond the specified maximum length if "@" (value) is requested and it's too large to fit in the specified field length. If the program encountered this: <INPUT TYPE=TEXT VALUE="This is my sample" SIZE=10> It would write out this: <This is my sample> HTMSTRIP.DOC 19 Jul 31, 1997 Defining Character-Translations (The Filter Table): HTMSTRIP allows you to translate specified characters as the text is read. This is useful on output for characters that are defined under Windows but that's about it. This should not be an issue because HTML is supposed to be platform independent; the Web designer (or the software used for the page) should have been smart enough to insert the proper entity reference instead. For example, "DisneyÆs" shows up on the Disney site for some reason. The filter table will translate this as "Disney's". Also, way too many Web designers use decimal 169 ("⌐", as in "⌐ 1996") as a copyright symbol; they're supposed to use the entity reference © instead. The filter table will translate this as "c 1996". There is a default character-translation table built into the entity lookup file (HTMSTRIP.INI). This will typically be loaded automatically by the program. You can update the translations in the lookup file or you can create your own filter file and invoke it by specifying the "/FILTER=filename" parameter. In most cases, however, you will not need to modify this table. The filter table is an ASCII text file which consists of a series of lines in the following format: inchar = outchar where "inchar" is the character to change from and "outstr" is what to change the character to. Both portions can consist of regular non-space ASCII text characters (like "A" or "z") as well as hexadecimal values (in the form &Hxx) or decimal values (in the form \nnn). Both sides must reference a single character (exactly one character is always translated into exactly one character). You cannot use a space or equal sign in either "inchar" or "outchar"; use the hexadecimal or decimal representations instead. The table does not have to be in any specified order. Lines can end with "/*" followed by a comment if you want. Hexadecimal and decimal equivalents are explained in BRUCEHEX.DOC. Examples: a = A /* Translate lowercase "a" into capital "A" \032 = _ /* Translate space (decimal 032, &H20 too) into underscore \027 = \032 /* Translate escape character to a space Some leading characters in INI files are treated specially within Wayne Software programs. INI lines that begin with any of the following characters may lead to odd results: "[", "/", "&", "\", ";", ":", "<", and ",". To avoid problems, use hexadecimal or decimal representations for these characters. For example, use \047 or &H2F if you want to override the definition of "/". HTMSTRIP.DOC 20 Jul 31, 1997 Author: This program was written by Bruce Guthrie of Wayne Software. It is free for use and redistribution provided relevant documentation is kept with the program, no changes are made to the program or documentation, and it is not bundled with commercial programs or charged for separately. People who need to bundle it in for-sale packages must pay a $50 registration fee to "Wayne Software" at the following address. Additional information about this and other Wayne Software programs can be found in the file BRUCE.DOC which should be included in the original ZIP file. The recent change history for this and the other programs is provided in the HISTORY.ymm file which should be in the same ZIP file where "y" is replaced by the last digit of the year and "mm" is the two digit month of the release; HISTORY.611 came out in November 1996. This same naming convention is used in naming the ZIP file (HTMSTymm.ZIP) that this program was included in. Comments and suggestions can also be sent to: Bruce Guthrie Wayne Software 113 Sheffield St. Silver Spring, MD 20910 e-mail: WayneSof@erols.com fax: (301) 588-8986 http://www.geocities.com/SiliconValley/Lakes/2414 Please provide an Internet e-mail address on all correspondence.